1 INTRODUCTION

1.1 Motivation

Coming from Vietnam, we have gained recognition of how globally important the real estate markets are, especially in such dynamic, well-developed regions such as the Seattle Metropolitan Area. The project’s objectives include exploring housing price trends in principal cities within the area. From data analysis, we hope to deliver insights that may work for local and international investors, including those from Vietnam.

The focus will be on a dataset of property sales, examining a wide range of features that influence housing prices and transaction frequency. The features in the dataset are mentioned below:

  1. Date: The variable represents the date when a house was sold, so understanding temporal trends will help us identify seasonal fluctuations in property prices and further timing investment decisions.

  2. Price: This is the housing price in the USD unit.

  3. House features: These include:

  • Bathrooms and Bedrooms: the number of rooms in the house.
  • Sqft_Living, Sqft_Lot, Sqft_Above, Sqft_Basement: The size can reflect property value and desirability. After analysis, we can understand how they impact the price.
  • Floors: Written in numbers. There are complete floors (count as 1) and incomplete floors (count as 0.5).
  • Waterfront: Written in numbers. If there is a waterfront, count as 1; otherwise, count as 0.
  • Some rating variables: View and Condition, based on buyers’ viewpoints.
  • Time variables: yr_built and yr_renovated, which can influence market values.
  • Location variables: Street, City, Statezip, which help us identify house price trends with locations.

1.2 Key Questions for Analysis

In our project analyzing house prices in the Seattle Metropolitan Area, we will answer several questions that can provide insights into buyer preferences and market trends, including transaction and house features.

  1. What was the trend for Housing Transaction and Price Distribution?

We aim to explore housing transactions, both transaction frequency and transaction values, to uncover overall market trends, including weekly and monthly fluctuations in activity. Additionally, analyzing price ranges will be one of our key focuses, since this will provide insights into the stability of the housing market during the period in question.

  1. How did the house age influence the price and the number of housing transactions? Did having a waterfront and better housing conditions lead to higher prices and the number of housing transactions?

We aim to examine the correlation among house age, price, and transaction. Besides, another factor that is in our analysis is the presence of waterfront, whether it affected customer behavior.

  1. How did square footage and the number of floors influence buyer behavior and property pricing?

We will discover whether the year in which properties were built in could significantly affect the price of the houses that have similar features (such as comparable area size or number of floors).

  1. How did square footage and the number of floors influence buyer behavior and property pricing?

The question aims to explore how the size of the living area and the number of floors impacted buyer preferences. Whether larger living spaces and higher floors led to higher prices is our focus.

  1. What is the Preferred Bath/Bed Ratio?

We desire to know the ratio between bathroom and bedroom that customers preferred. The question will help us identify buyers’ priorities regarding housing functionality and livability,

2 METHODOLOGY

2.1 Libraries

To process the given dataset, we use a set of libraries, including:

  • readr: We use the library to read the dataset (.csv file)
  • dyplr: We use the library to transform and manipulate the data, including filtering, arranging, summarizing, and grouping variables.
  • lubridate: The package is used in working with dates and times more interactively. We can simply working process with time variables thanks to the library.
  • For visualization, our team used two libraries: ggplot2 and plotly. We use ggplot2 to ensure flexibility and visualization, whereas plotly can supplement the interaction between charts and the audience.
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(lubridate)

2.2 Dataset reading and Area selection

First, we read the dataset USA Housing Dataset (1).csv. Then, we select datasets from principal cities in the Seattle Metropolitan Area, including Seattle, Tacoma, Bellevue, Kent, Everett, Renton, Auburn, Lakewood, Redmond (the classification is from Stastistical Atlas) The newly set up one is named masster_housing_city.

To better describe the house price, we used the unit “thousdand dollars” (KUSD), so we divided the initial value by 1000.

master_housing <- read_csv("USA Housing Dataset (1).csv")

# Divide the price by 1000 to use "KUSD" unit
master_housing <- master_housing %>%
  mutate(price = price / 1000)

# Round the value in price to 1 decimal place
master_housing <- master_housing %>%
  mutate(price = round(price, 1))

master_housing_city <- master_housing %>%
  filter(city %in% c("Seattle", "Tacoma", "Bellevue", "Kent", "Everett", "Renton", "Auburn", "Lakewood", "Redmond")) %>%
  group_by(city)

2.3 Transaction

2.3.1 Transaction frequency and Transaction value overtime

2.3.1.1 Transaction per days of the week

We summed up the transaction frequency and value to see on what day the transaction takes place the most by adding a new column via mutate and group_by days of the week.

# Transaction frequency per day
transactions_per_day <- master_housing_city %>%
  #mutate(day = as.Date(date)) %>%
  group_by(date) %>%
  summarise(number_of_transactions = n())

transactions_per_day_of_week <- transactions_per_day %>%
  mutate(day_of_week = weekdays(date)) %>%
  group_by(day_of_week) %>%
  summarise(total_transactions = sum(number_of_transactions)) %>%
  mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))

# Transaction value per day
total_price_by_day_of_week <- master_housing_city %>%
  mutate(day_of_week = weekdays(date)) %>%
  group_by(day_of_week) %>%
  summarise(total_price = sum(price, na.rm = TRUE))

2.3.1.2 Transaction per week

We used as.Date and group_by to group datasets by week, so we can see the trend and compare weekly transactions:

# Transaction frequency per week
transactions_per_week <- transactions_per_day %>%
  mutate(week = as.Date(cut(date, breaks = "week"))) %>%
  group_by(week) %>%
  summarise(total_transactions = sum(number_of_transactions))

# Transaction value per week
total_transaction_cost_per_week <- master_housing_city %>%
  mutate(week = floor_date(date, "week")) %>%
  group_by(week) %>%
  summarise(total_transaction_cost = sum(price, na.rm = TRUE))

2.3.2 House Price Distribution

We cleaned the data of house price, including removing outliers and price equal to 0.

# Remove outliers and zero price from the dataset
Q1 <- quantile(master_housing_city$price, 0.25)
Q3 <- quantile(master_housing_city$price, 0.75)
IQR <- Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Filter the dataset to remove outliers and zero price
cleaned_data <- master_housing_city %>%
  filter(price > 0 & price >= lower_bound & price <= upper_bound)

#clean data for each week
data_clean_each_week <- master_housing_city %>%
  mutate(week = as.Date(cut(date, breaks = "week"))) %>%
  group_by(week) %>%
  filter(price > 0 & price >= lower_bound & price <= upper_bound)

To easily determine type of chart, we calculated the mean and median of the dataset after being cleaned:

mean_price <- mean(cleaned_data$price)
median_price <- median(cleaned_data$price)

# Calculate median price per week
median_price_per_week <- data_clean_each_week %>%
  group_by(week) %>%
  summarise(median_price = median(price))

Moreover, house prices were categorized in range to see the gap more clearly.

# Categorize the price of houses
data_clean_each_week <- data_clean_each_week %>%
  mutate(price_category = case_when(
    price < 100 ~ "<100,000",
    price >= 100 & price < 400 ~ "100,000-400,000",
    price >= 400 & price < 700 ~ "400,000-700,000",
    price >= 700 & price < 1000 ~ "700,000-1,000,000",
    price >= 1000 ~ ">1,000,000"
  ))

# Count the number of houses in each price category per week
price_category_per_week <- data_clean_each_week %>%
  group_by(week, price_category) %>%
  summarise(count = n()) %>%
  ungroup()

2.4 House Features

2.4.1 House Age, Price, and Transactions

We calculated the average price by house age to support the analysis by using as.numeric to calculate age specifically:

# Calculate average price by house age
stats_by_age <- cleaned_data %>%
  mutate(year = as.numeric(format(date, "%Y")),
         age = year - yr_built) %>%
  group_by(age) %>%
  summarise(avg_price = mean(price, na.rm = TRUE))

We calculated the number of transactions for each house age and removed outliers:

transaction_counts <- master_housing %>%
  mutate(age = as.numeric(format(Sys.Date(), "%Y")) - yr_built) %>%
  group_by(age) %>%
  summarise(transaction_count = n())

Q1 <- quantile(transaction_counts$transaction_count, 0.25)
Q3 <- quantile(transaction_counts$transaction_count, 0.75)
IQR <- Q3 - Q1
transaction_counts <- transaction_counts %>%
  filter(transaction_count >= (Q1 - 1.5 * IQR) & transaction_count <= (Q3 + 1.5 * IQR))

2.4.2 Waterfront and Housing conditions

The average house prices were calculated for waterfront and non-waterfront properties (outliers were removed):

# Calculate the average house price for waterfront and non-waterfront properties
waterfront_prices <- master_housing %>%
  group_by(waterfront) %>%
  summarise(average_price = mean(price, na.rm = TRUE))

# Remove outliers using IQR method
Q1 <- quantile(waterfront_prices$average_price, 0.25)
Q3 <- quantile(waterfront_prices$average_price, 0.75)
IQR <- Q3 - Q1
waterfront_prices <- waterfront_prices %>%
  filter(average_price >= (Q1 - 1.5 * IQR) & average_price <= (Q3 + 1.5 * IQR))

2.4.2.1 House Conditions, Waterfront, and Transaction frequency

We calculated the transaction distribution based on house conditions and waterfront before removing outliers.

# Calculate the transaction distribution based on houses' condition (1-5 level) and waterfront
condition_distribution <- master_housing %>%
  group_by(condition, waterfront) %>%
  summarise(transaction_count = n())

# Remove outliers using IQR method for each condition and waterfront combination
condition_distribution <- condition_distribution %>%
  group_by(condition) %>%
  mutate(Q1 = quantile(transaction_count, 0.25),
         Q3 = quantile(transaction_count, 0.75),
         IQR = Q3 - Q1) %>%
  filter(transaction_count >= (Q1 - 1.5 * IQR) & transaction_count <= (Q3 + 1.5 * IQR)) %>%
  ungroup() %>%
  select(-Q1, -Q3, -IQR)

2.4.2.2 House Conditions and Average Price

The average price grouped by house conditions was calculated (removing outliers).

# Calculate the average house price by house condition
avg_price_by_condition <- master_housing %>%
  group_by(condition) %>%
  summarise(avg_price = mean(price, na.rm = TRUE))

# Remove outliers using IQR method for each condition
avg_price_by_condition <- avg_price_by_condition %>%
  mutate(Q1 = quantile(avg_price, 0.25),
         Q3 = quantile(avg_price, 0.75),
         IQR = Q3 - Q1) %>%
  filter(avg_price >= (Q1 - 1.5 * IQR) & avg_price <= (Q3 + 1.5 * IQR)) %>%
  select(-Q1, -Q3, -IQR)

2.4.3 House Features given House Price and Year Built

2.4.3.1 Price Category and Year Built

We divided price range into categories and filter data for selected years:

# Define price categories
master_housing <- master_housing %>%
  mutate(price_category = case_when(
    price < 100 ~ "<100",
    price >= 100 & price < 400 ~ "100-400",
    price >= 400 & price < 700 ~ "400-700",
    price >= 700 & price < 1000 ~ "700-1,000",
    price >= 1000 ~ ">1,000"
  ))

# Filter the data for the years 1954, 1974, 1994, 2014
filtered_years <- master_housing %>%
  filter(yr_built %in% c(1954, 1974, 1994, 2014))

2.4.3.2 Price Category, Year Built, and Number of Floors

We categorized price range with houses having 1 floor and 2 floors, two most common number of floors.

# Filter the data for houses built in 1954, 1974, 1994, and 2014 with 1 floor
filtered_data <- master_housing %>%
  filter(yr_built %in% c(1954, 1974, 1994, 2014) & floors == 1)

# Categorize the price of houses
filtered_data <- filtered_data %>%
  mutate(price_category = case_when(
    price < 100 ~ "<100",
    price >= 100 & price < 400 ~ "100-400",
    price >= 400 & price < 700 ~ "400-700",
    price >= 700 & price < 1000 ~ "700-1000",
    price >= 1000 ~ ">1000"
  ))

# Count the number of houses in each price category per year built
price_category_per_year_built <- filtered_data %>%
  group_by(yr_built, price_category) %>%
  summarise(count = n()) %>%
  ungroup()
# Filter the data for houses built in 1954, 1974, 1994, and 2014 with 2 floors
filtered_data_2_floors <- master_housing %>%
  filter(yr_built %in% c(1954, 1974, 1994, 2014) & floors == 2)

# Categorize the price of houses
filtered_data_2_floors <- filtered_data_2_floors %>%
  mutate(price_category = case_when(
    price < 100 ~ "<100",
    price >= 100 & price < 400 ~ "100-400",
    price >= 400 & price < 700 ~ "400-700",
    price >= 700 & price < 1000 ~ "700-1,000",
    price >= 1000 ~ ">1,000"
  ))

# Count the number of houses in each price category per year built
price_category_per_year_built_2_floors <- filtered_data_2_floors %>%
  group_by(yr_built, price_category) %>%
  summarise(count = n()) %>%
  ungroup()

2.4.3.3 Average Price, Year Built, and Square Feet of Living

We filtered data for specific time milestone to compare, categorized square of living, and calculated the average price.

# Filter the data for houses built in 1954, 1974, 1994, and 2014
filtered_data_sqft_living <- master_housing %>%
  filter(yr_built %in% c(1954, 1974, 1994, 2014))

# Categorize the sqft_living of houses
filtered_data_sqft_living <- filtered_data_sqft_living %>%
  mutate(sqft_living_category = case_when(
    sqft_living < 1000 ~ "<1000",
    sqft_living >= 1000 & sqft_living < 1500 ~ "1000-1500",
    sqft_living >= 1500 & sqft_living < 2000 ~ "1500-2000",
    sqft_living >= 2000 & sqft_living < 2500 ~ "2000-2500",
    sqft_living >= 2500 ~ ">2500"
  ))

# Calculate the average price of houses in each sqft_living category per year built
avg_price_sqft_living_category_per_year_built <- filtered_data_sqft_living %>%
  group_by(yr_built, sqft_living_category) %>%
  summarise(avg_price = mean(price, na.rm = TRUE)) %>%
  ungroup()

2.4.4 Floors, Square of Footage, and Bath/Bed Ratio

2.4.4.1 House Price and Number of Floors

We calculated how many houses per floor.

# Summarize the count of houses per number of floors
floors_count <- master_housing_city %>%
  group_by(floors) %>%
  summarise(count = n())

# Summarize the count of houses per number of floors per city
floors_count_city <- master_housing_city %>%
  group_by(city, floors) %>%
  summarise(count = n())

And we compared houses with one and two floors in each city in research.

# Combine the data for 1-floor and 2-floor houses
combined_houses <- master_housing_city %>%
  filter(floors == 1 | floors == 2) %>%
  mutate(floor_type = ifelse(floors == 1, "1-Floor", "2-Floor"))

# Summarize the median price of houses per city and floor type
combined_houses_price_city <- combined_houses %>%
  group_by(city, floor_type) %>%
  summarise(median_price = median(price, na.rm = TRUE))

2.4.4.2 Number of Houses and Square Footage of Living

We categorized square footage of living and counted how many houses were within each range.

living_ranges <- c(0, 1000, 2000, 3000, 4000, 5000, 7000, 9000, 10000, Inf)
living_labels <- c("0-1000", "1000-2000", "2000-3000", "3000-4000", "4000-5000", "5000-7000", "7000-9000", "9000-10000", "10000+")

# Create a new column in the dataframe for the lot range categories
master_housing_city <- master_housing_city %>%
  mutate(living_range = cut(sqft_living, breaks = living_ranges, labels = living_labels, right = FALSE))

# Calculate the frequency of houses in each lot range
living_range_frequency <- master_housing_city %>%
  group_by(living_range) %>%
  summarise(Frequency = n())

2.4.4.3 Median House Price by Square Footage of Living

median_price_by_living_range <- master_housing_city %>%
  group_by(living_range) %>%
  summarise(Median_Price = median(price, na.rm = TRUE))

2.4.4.4 Bathroom/Bedroom Ratio

We divided the number of baths by the bedrooms, calculated average price, and removed rows with N/A values.

# Calculate bath/bed ratio
master_housing_city <- master_housing_city %>%
  mutate(bath_bed_ratio = bathrooms / bedrooms)

# Create bins for the bath/bed ratio
master_housing_city <- master_housing_city %>%
  mutate(bath_bed_bin = cut(bath_bed_ratio, 
                             breaks = seq(0, max(master_housing_city$bath_bed_ratio, na.rm = TRUE), by = 0.2),
                             include.lowest = TRUE,
                             right = FALSE))

# Calculate average price for each bath/bed ratio bin
average_price_by_bin <- master_housing_city %>%
  group_by(bath_bed_bin) %>%
  summarize(average_price = mean(price, na.rm = TRUE))
# Remove rows with NA values in the bath_bed_bin column in the average_price_by_bin dataframe
average_price_by_bin <- average_price_by_bin %>%
  filter(!is.na(bath_bed_bin))

3 DATA VISUALIZATION

3.1 Transaction

3.1.1 Transaction frequency

3.1.1.1 Transaction frequency per day of the week

transactions_per_day_of_week_plot <- ggplotly(ggplot(transactions_per_day_of_week, aes(x = day_of_week, y = total_transactions)) +
  geom_bar(stat = "identity", color = "blue", fill = "skyblue") +
  labs(title = "Total Number of Transactions per Day of the Week",
       x = "Day of the Week",
       y = "Total Number of Transactions") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 15,         
                              face = "bold")
  ))

transactions_per_day_of_week_plot

Figure 3.1.1.1. Transaction frequency of principal cities in Seattle Metropolitan Area per days of the week.

  • Observation. Transactions increased to the peak on Tuesdays, followed by a continual decrease till the end of Friday. Meanwhile, weekends received negligible number of transactions, reconfirming our observation above.

3.1.1.2 Transaction frequency per week

transactions_per_week_plot <- ggplot(transactions_per_week, aes(x = week, y = total_transactions)) +
  geom_bar(stat = "identity", color = "blue", fill = "skyblue") +
  geom_line(color = "red", size = 1) +
  labs(title = "Total Number of Transactions per Week",
       x = "Week",
       y = "Total Number of Transactions") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 15,         
                              face = "bold")
  )

transactions_per_week_plot <- ggplotly(transactions_per_week_plot)

transactions_per_week_plot

Figure 3.1.1.2. Transaction frequency principal cities in Seattle Metropolitan Area per week.

  • Observation. Transactions reached the highest value in June, especially gradually rising till the final week of June. Another observation from the chart is that in May and June, the highest transaction volumes were noted in the final week.

3.1.2 Transaction values

3.1.2.1 Transaction values per day of the week

total_price_by_day_of_week$day_of_week <- factor(total_price_by_day_of_week$day_of_week, 
                                                 levels = c("Monday", "Tuesday", "Wednesday", 
                                                            "Thursday", "Friday", "Saturday", "Sunday"))

transaction_values_per_day_plot <- ggplotly(ggplot(total_price_by_day_of_week, aes(x = day_of_week, y = total_price)) +
  geom_bar(stat = "identity", color = "blue", fill = "skyblue") +
  labs(title = "Transaction Values per Day of the Week",
       x = "Day of the Week",
       y = "Total Price (KUSD)") +
   theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 15,         
                              face = "bold")
  ))

transaction_values_per_day_plot

Figure 3.1.2.1. Transaction values of principal cities in Seattle Metropolitan Area per days of the week (KUSD).

  • Observation. The tendency was the same as that in transaction frequency per days of the week. The highest values were transacted on Tuesdays and decreased when it came to final days of the week.

3.1.2.2 Transaction values per week

total_transaction_cost_plot <- ggplotly(ggplot(total_transaction_cost_per_week, aes(x = week, y = total_transaction_cost)) +
  geom_bar(stat = "identity", color = "blue", fill = "skyblue") +
                                          geom_line(color = "red", size = 1) +

  labs(title = "Transaction Values per Week",
       x = "Week",
       y = "Total price (KUSD)") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 15,         
                              face = "bold")
  ))

total_transaction_cost_plot

Figure 3.1.2.2. Transaction values of principal cities in Seattle Metropolitan Area per week (KUSD).

  • Observation. Total house prices peaked in the final week of each month, May and June. Apart from the final week of May, the house price continually increased weekly till the final week of June, and then witnessed a sharp fall.

3.1.3 House price distribution

histogram_price <- ggplotly(ggplot(cleaned_data, aes(x = price)) + 
  geom_histogram(aes(y = ..density..), binwidth = 30, color = "black", fill = "skyblue") +
  geom_density(color = "red", size = 1) +
  geom_vline(aes(xintercept = mean_price), color = "blue", linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = median_price), color = "green", linetype = "dashed", size = 1) +
  labs(title = "Histogram of Housing Prices (Without Outliers and Zero Price)",
       x = "Price (KUSD)",
       y = "Density") +
   theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

histogram_price

Figure 3.1.3.1. Distribution of house prices (without outliers and prices equal to 0).

  • Observation. The distribution was skewed right, with the highest density being seen between 330 KUSD and 360 KUSD. Most houses ranged from 240 KUSD and 540 KUSD, and the median price was 455 KUSD.
price_range_per_week_plot <- ggplot(data_clean_each_week, aes(x = week, y = price)) +
  geom_boxplot(fill = "skyblue", color = "blue", outlier.shape = NA) +
  geom_line(data = median_price_per_week, aes(x = week, y = median_price), color = "red", size = 1) +
  labs(title = "Price Range per Week (Without Outliers) and Median Prices",
       x = "Week",
       y = "Price (KUSD)") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold"))

price_range_per_week_plot <- ggplotly(price_range_per_week_plot)

price_range_per_week_plot

Figure 3.1.3.2. House Price Ranges (without outliers and prices equal to 0) by week.

  • Observation. The price range was stable during the transaction period, with the median falling between 385.475 KUSD and 493.875 KUSD. Apart from two first weeks of May which collected fewer datasets, the final week of June, also the week with the highest transaction value and frequency, saw the lowest median house price among others.
price_category_per_week_plot <- ggplot(price_category_per_week, aes(x = week, y = count, fill = price_category)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Houses in Each Price Category per Week",
       x = "Week",
       y = "Number of Houses",
       fill = "Price Category (KUSD)") +
 theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold"))

price_category_per_week_plot <- ggplotly(price_category_per_week_plot)

price_category_per_week_plot

Figure 3.1.3.3. House Price Categories (without outliers and prices equal to 0) by week.

  • Observation. Two most preferred categories were 100 - 400 KUSD and 400 KUSD - 700 KUSD. Customers purchased houses ranging between the latter type the most, except for the two final weeks of May and June, also weeks with the highest number of transactions monthly. Their choices were also consistent in the given duration.

3.2 House features

3.2.1 House Age, Price, and Transaction Frequency

3.2.1.1 Average House Age (Years) and Price (KUSD)

house_age_price <- ggplotly(ggplot(stats_by_age) +
  geom_line(aes(x = age, y = avg_price, color = "Average Price")) +
  geom_smooth(aes(x = age, y = avg_price, color = "Trend"), method = "loess", se = FALSE) +
  labs(title = "Average House Price by Age",
       x = "Age of House (Years)",
       y = "Average Price (KUSD)") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")) +
  scale_color_manual(values = c("Average Price" = "blue", "Trend" = "red")))

house_age_price

Figure 3.2.2.1. The relationship between house age (years) and price (KUSD).

  • Observation: The chart suggests that newer homes (0–40 years) generally have higher average prices (in the range of 450 - 600 KUSD). As homes age, their prices tend to decrease, particularly in the 40–80 year range, with the recorded trend being below 450 . The chart also shows a price resurgence for homes aged between 80 years and above, where the average price could reach its peak of 700. Overall, there was no clear trend or specific relationship between house age and price. Its trend fluctuated throughout and it depended mostly on the style design, which would be further elaborated.

3.2.1.2 House Age (Years) and Transaction Frequency

house_age_transaction_frequency <- ggplotly(ggplot(transaction_counts, aes(x = age, y = transaction_count)) +
  geom_smooth(method = "loess", size = 1, color = "blue", se = FALSE) +
  labs(title = "Number of Transactions by House Age",
       x = "House Age (years)",
       y = "Number of Transactions") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

house_age_transaction_frequency

Figure 3.2.1.2. The relationship between house age (years) and transaction frequency.

  • Observation: In the Seattle metropolitan area, the relationship between house age and the number of transactions shows a downward trend as homes age. Conversely, as homes age, they tend to see fewer transactions. Finally, the Seattle metropolitan area reflects this trend as well. With strong demands for newer properties due to the region’s tech-driven economy, buyers tend to prioritize homes that are ready for immediate occupancy, pushing down the transaction rates for older homes.

3.2.2 Waterfront and Housing Conditions

3.2.2.1 Waterfront - Transaction distribution based on houses’ condition (1-5 level)

waterfront_price <- ggplotly(ggplot(waterfront_prices, aes(x = factor(waterfront), y = average_price, fill = factor(waterfront))) +
  geom_bar(stat = "identity") +
  labs(title = "Average House Price: Waterfront vs Non-Waterfront (KUSD)",
       x = "Waterfront (0 = No, 1 = Yes)",
       y = "Average House Price (KUSD)") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

waterfront_price

Figure 3.2.2.1. Comparison of house prices (KUSD) between houses with and without waterfront.

  • Observation. The relationship between waterfront properties and house prices is significant due to the unique appeal of living by water. Homes located directly on waterfronts, whether ocean, lake, or riverfront, generally command much higher prices than their non-waterfront counterparts.

3.2.2.2 House conditions, Waterfront, and Transaction Frequency

condition_wtf_transaction <- ggplotly(ggplot(condition_distribution, aes(x = factor(condition), y = transaction_count, fill = factor(waterfront))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Transaction Count Based on Houses' Condition and Waterfront",
       x = "Condition (1 = Poor, 5 = Excellent)",
       y = "Transaction Count",
       fill = "Waterfront (0 = No, 1 = Yes)") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

condition_wtf_transaction

Figure 3.2.2.2. Transaction frequency based on House Conditions and Waterfront.

  • Observation. From the chart, it is significant that people will prefer to purchase a house that has at least an average condition (level 3). There were hardly any housing transactions for houses that have a level of condition below level 3. On the other hand, the chart has its peak at level 3 with over 2500 transactions and decreased by half in level 4 with just over 1000 transactions. For housing that has the condition at level 5, which means an excellent condition, although they are not likely to have many transactions compared to levels 3 and 4, they still have a quite suitable amount of nearly 500 transactions.

3.2.2.3 House Conditions and Average house price

house_condition_price <- ggplotly(ggplot(avg_price_by_condition, aes(x = factor(condition), y = avg_price)) +
  geom_line(group = 1, color = "blue") +
  geom_point(color = "red") +
  geom_smooth(method = "loess", se = FALSE, color = "green") +
  labs(title = "Average House Price (KUSD) by House Condition",
       x = "Condition (1 = Poor, 5 = Excellent)",
       y = "Average Price (KUSD)") +
   theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 12,         
                              face = "bold")))

house_condition_price

**Figure 3.2.2.3. The relationship between average house price (KUSD) and conditions.

  • Observation. When examining the relationship between house prices and housing conditions, it becomes clear that better-maintained homes often sell for higher prices. Houses rated in excellent condition (condition 5) command a much higher price compared to homes in poor condition. Well-maintained homes attract more buyers, as they require less immediate investment for repairs and renovations. Moreover, homes in excellent condition often have more modern features, upgraded appliances, and better overall curb appeal, which contribute to their higher market value. This aligns with research showing that properties in better condition sell more quickly and for a premium price in competitive markets.

3.2.3 House Features given House Price and Year Built

3.2.3.1 Price Category and Year Built

ggplotly(ggplot(filtered_years, aes(x = factor(yr_built), fill = price_category)) +
          geom_bar(position = "fill") +
          labs(title = "Proportion of House Prices for Selected Years by Price Category (KUSD)",
               x = "Year Built",
               y = "Proportion",
               fill = "Price Category") +
          theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")) +
          scale_fill_brewer(palette = "Set3") +
          theme(axis.text.x = element_text(angle = 45, hjust = 1)))

Figure 3.3.3.1. Proportion of House Prices for Selected Years by Price Category (KUSD).

  • Observation. Overall, across time, once more the dataset confirms that the expected price ranges for a house were 100,000 to 400,000 and 400,000 and 700,000. This is confirmed through the record of the Median sales price of homes sold in the U.S. from 1980 to 2024 by the Federal Reserve Bank of St. Louis in 2024. In 1954, the proportion was not 1.00 due to the processing of data to eliminate any observations that contained null values.

3.2.3.2 Price Category, Year Built, and Number of Floors

ggplotly(ggplot(price_category_per_year_built, aes(x = factor(yr_built), y = count, fill = price_category)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Houses in Each Price Category by Year Built - 1 Floor (KUSD)",
       x = "Year Built",
       y = "Number of Houses",
       fill = "Price Category") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

Figure 3.3.3.2. Number of Houses in Each Price Category by Year Built - 1 Floor (KUSD).

ggplotly(ggplot(price_category_per_year_built_2_floors, aes(x = factor(yr_built), y = count, fill = price_category)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Houses in Each Price Category by Year Built - 2 Floors",
       x = "Year Built",
       y = "Number of Houses",
       fill = "Price Category") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

Figure 3.3.3.2.1. Number of Houses in Each Price Category by Year Built - 2 Floors (KUSD)

  • Observation.
  1. Specifically, when examining houses with comparable features (number of floors), it seems that during the period 1954 - 1974, houses were usually built with 1 floors, whereas 2 floors are more common after 1974. As for single-story houses, the bar chart clearly saw a big gap between houses built in 1954 and post 1974, with the price category between 100 KUSD and 400 KUSD were the most common choice.
  2. As for two-storey houses, it started to gain popularity after 1974, with the expected price varied from 100 KUSD to 700 KUSD. Also, as two-storey houses became more and more popular after their growth in 1994, the variety of price range was more diverse in 1994.

3.2.3.3 Average Price, Year Built, and Square Footage of Living

ggplotly(ggplot(avg_price_sqft_living_category_per_year_built, aes(x = factor(yr_built), y = avg_price, fill = sqft_living_category)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average House Price by Sqft Living Category and Year Built (KUSD)",
       x = "Year Built",
       y = "Average Price",
       fill = "Sqft Living Category") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

Figure 3.3.3.3. Average House Price by Sqft Living Category and Year Built (KUSD).

  • Observation. We found that the 1000-1500, 1500-2000, and 2000-2500 sqft categories, the prices have generally remained more stable across the different year-built ranges; but homes built in 1954 in the >2500 sqft category had a significantly higher average price than those built in later years . We came to a conclusion that older, larger homes are valued for their unique architecture or location, and their prices are more fluctuating.

3.2.3.4 House Prices given Year Built and Waterfront Status

# Filter the data for the years 1954, 1974, 1994, 2014
filtered_years <- master_housing %>%
  filter(yr_built %in% c(1954, 1974, 1994, 2014))

# Create the interactive boxplot
ggplotly(ggplot(filtered_years, aes(x = factor(yr_built), y = price, fill = factor(waterfront, labels = c("No waterfront", "Waterfront")))) +
          geom_boxplot() +
          labs(title = "House Prices (KUSD) for Selected Years Grouped by Waterfront Status",
               x = "Year Built",
               y = "Price (KUSD)",
               fill = "Waterfront") +
          theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")) +
          scale_fill_brewer(palette = "Set3") +
          theme(axis.text.x = element_text(angle = 45, hjust = 1))
          )

Figure 3.3.3.4. House Prices (KUSD) for Selected Years Grouped by Waterfront Status.

  • Observation. Through the chart, we found that houses with waterfront are dramatically more expensive as to households with no waterfront. Non-waterfront properties have greater variability in price, with some notable outliers. This is obvious as houses with waterfront might be considered “luxury properties” or premium due to their view and location. Anothering interesting fact was that while very few houses with waterfront were recorded in the dataset, most of them were built in 1974.

3.2.4 Floors, Square Footage of Living, and Bath/Bed Ratio

3.2.4.1 House Price and Number of Floors

ggplot(floors_count, aes(x = as.factor(floors), y = count)) +
  geom_bar(stat = "identity", fill = "blue", color = "red") +
  geom_text(aes(label = count), vjust = -0.5) +
  labs(title = "Frequency of Houses by Number of Floors",
       x = "Number of Floors",
       y = "Number of Houses") +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5)) + 
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold"))
**Figure 3.2.4.1.** The relationship between house distribution and number of floors.

Figure 3.2.4.1. The relationship between house distribution and number of floors.

  • Observation. The chart illustrates the number of houses categorized by the number of floors, with the highest concentration in 1-floor and 2-floor homes. The chart reveals that single-story houses are the most frequent, with 1,195 houses, followed by 2-floor houses, numbering 850. There is a deep decrease for houses with 1.5 floors (300 houses), and much fewer homes have 2.5 floors (24 houses), 3 floors (103 houses), and 3.5 floors (2 houses).
ggplotly(ggplot(combined_houses_price_city, aes(x = reorder(city, -median_price), y = median_price, fill = floor_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Median Price (KUSD) of 1-Floor vs 2-Floor Houses in Each City",
       x = "City",
       y = "Median Price (KUSD)",
       fill = "Floor Type") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

Figure 3.2.4.1.1. Comparison of Median Price of Houses with 1 and 2 floors in Seattle Metropolitan Area cities.

  • Observation. The chart illustrates clearly that in all cities, 2-floor houses tend to have higher median prices than 1-floor houses. This trend is especially evident in affluent cities like Bellevue and Redmond, where 2-floor houses are priced significantly higher than 1-floor houses. In Bellevue, the median price of 2-floor houses exceeds 750 KUSD, while 1-floor houses are priced below 600 KUSD. Even in more suburban and affordable cities like Renton, Kent, and Auburn, 2-floor houses still hold a price premium, although the difference is less significant compared to urban centers or wealthier areas.

3.2.4.2 Number of Houses and Square Footage of Living

ggplotly(ggplot(living_range_frequency, aes(x = living_range, y = Frequency)) +
  geom_bar(stat = "identity", fill = "pink", color= "blue") +
  labs(title = "Frequency of Houses by Living Range",
       x = "Living Range (sqft)",
       y = "Frequency") + 
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

Figure 3.2.4.2. House Distributions by Square Footage of Living.

  • Observation. The chart shows the distribution of houses based on their square footage of living. As observed, 1,000-2,000 sq ft houses (medium-sized homes) are the most common, with the highest frequency of around 1,200 homes in this range. The second most frequent range is 2,000-3,000 sq ft, with a frequency of just under 900 houses. Very few houses are in the 5,000-10,000 sq ft and 10,000+ sq ft ranges, indicating that these large homes are much less common.

3.2.4.3 Median House Price by Square Footage of Living

ggplotly(ggplot(median_price_by_living_range, aes(x = living_range, y = Median_Price)) +
  geom_bar(stat = "identity", fill = "pink", color = "blue") +
  labs(title = "Median House Price by Living Range (KUSD)",
       x = "Living Range (sqft)",
       y = "Median Price (KUSD)") +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

Figure 3.2.4.3. Median House Price (KUSD) by Square Footage of Living (sqft).

  • Observation. As square footage increases, the median house price also rises. Houses in the smallest range, 0-1,000 sq ft, have the lowest median prices 150 KUSD, indicating that smaller homes are the most affordable. The median price increases steadily across the ranges, with noticeable jumps in the 3,000-4,000 sq ft and 4,000-5,000 sq ft ranges (600 KUSD - 800 KUSD), reflecting a growing demand for larger homes. The most significant spike in price occurs for houses over 10,000 sq ft, where prices soar dramatically, indicating that large luxury homes command substantially higher prices (6000 KUSD)

3.2.4.4 Bathroom/Bedroom Ratio

ggplotly(ggplot(average_price_by_bin, aes(x = bath_bed_bin, y = average_price)) +
  geom_bar(stat = "identity", fill = "pink", width = 0.7) +
  geom_smooth(aes(group = 1), method = "lm", se = FALSE, color = "dark blue", linetype = "dashed") +
  labs(title = "Average House Price by Bath/Bed Ratio Bin with Trendline",
       x = "Bath/Bed Ratio Bin",
       y = "Average House Price") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 13)) +
  theme(
    plot.title = element_text(hjust = 0.5,   
                              color = "blue", 
                              size = 10,         
                              face = "bold")))

Figure 3.2.4.4. Average House Price by Bath/Bed Ratio Bin with Trendline.

  • Observation. From the chart, we could observe that the optimal ratio of bathroom over bedroom (between 0.6-0.8 to 1.2-1.4) was the most ideal option with the expected price that fluctuated around the central tendency of the chart (between 500 KUSD and 700 KUSD). However, the low or unnecessarily high ratio also corresponded well with very low prices or very high prices. The ratio of bathroom over bedroom and its corresponding price also reflected the comfort and conditions of the houses, which would be further elaborated in the discussion.

Note.

  1. 0.25 Bathroom: Usually just a toilet (sometimes with a sink), no shower or bathtub.

  2. 0.5 Bathroom: Contains a toilet and sink, but no shower or bathtub.

  3. 1.0 Bathroom: A full bathroom with a toilet, sink, and either a shower or bathtub.

4 DISCUSSION

4.1 Transaction

We can explain the trends for transaction frequency and transaction values during the period for several reasons. Firstly, the transaction took place the most on Tuesdays and other weekdays, apart from a drop on Fridays. This may be because buyers and sellers want to avoid delays when rectifying anything going wrong in the purchasing process (DeMarco, 2023); therefore, they tried to lower Friday transactions. On weekends, because these are not business days, not many services and banks are active, leading to a negligible of transactions. Thus, the implementation of the purchasing process and weekend inquiries will be moved to next Monday, which can explain their lower transaction frequency and values compared to Tuesdays. Secondly, transaction frequency and values peaked in the final week of May and June. A potential explanation for the observation would be lower interest savings from additional days before going to the upcoming months and better price deals because sellers tend to close on their houses by the end of the month. (Frame, 2015) The high number and volume of transactions recorded in June, were the same as what was reported in the National Association of Realtors (Nadia, 2019) or NJLux (2024). This tendency was because of the increasing demand, including families resettling for children before a new academic year and warmer weather conditions compared to winter.

For house price distribution, the consistent price range throughout the period may imply the stability of the housing market or the predictability of customer behavior. The stability was similar to the prediction compiled by The Week (2014). This also accentuates the correlation between the trends for housing’s transaction frequency and values. Regarding the weekly price range observed, a slight drop in the range of the final week of June could be reasoned by the aforementioned sellers’ striving to deal with the prices.

4.2 Year Built affecting Comparable Houses’ Prices

After finding that single-storey houses used to be preferred pre-1974 but the trend shifted immediately when two-storey houses were introduced after 1974. It can be analyzed that pre-1974 houses were often small and single-story as they might have been built for post-war families that had small incomes. The Ranch-style house is one common example that was extremely popular due to the booming post-war middle class of the 1940s to the 1970s (Salant, 2006). After 1974, house demand continued rising but the decline of land availability led to the popularity of two-story houses to optimize living areas. The Ranch-style lost its popularity was also partly due to the suburban area.

Also, the fact that older houses tended to have more fluctuating prices and were more costly due to the design they followed and the unique historical values they held. This could be explained by the historical significance and unique design of old houses that commanded more price, and it had to be big to follow the construction style (such as Queen Anne style).

Lastly, speaking of the location of houses, most houses with waterfront views were built in 1974 but not others. This might be explained by the increase in income and growth of recreational lifestyle after the war during the 1970s. A surge of middle and upper-middle class were also able to afford beachfront houses. Most importantly, the waterfront development of the government to adapt this space had also attracted people to construct houses here (Morehouse, 1981). There is no paper or common sense to back it up, and it might be attributed to the non-representative sample of the data.

4.3 Number of Floors and Square Footage of Living Space Influence Prices and Buyer Behaviors

Speaking of floors of houses, people tended to prefer houses with either one floor or two floors. People prefer these types due to their cost efficiency, ease of maintenance, and better accessibility for families with children or elderly members. Additionally, these homes offer practical space optimization, particularly in urban areas with limited land and higher construction costs, while still providing energy efficiency and privacy. (Booth (2022)), (Montoya (2021)). The obvious trend that two-story houses were more expensive than single-story houses suggests that the additional living space and functionality provided by an extra floor consistently adds value to homes, leading to higher prices. Especially in luxurious and expensive cities, the gap was more significant.

Regarding house sizes, the buying behavior in which people tend to make efficient-cost purchases could also be a justification for the popularity of mid-sized houses. These sizes are likely ideal for families and fit well within budget and space constraints (Booth (2022)). Also, the positive correlation between the house’s size and the price can also be confirmed through our findings. Thus, square footage significantly influences house prices, with larger homes commanding higher prices due to their added space, features, and appeal to wealthier buyers.

4.4 Bathroom/Bedroom Ratio, Efficiency Cost, and Ideal Choices

Properties with a low bath/bed ratio are generally perceived as less desirable, leading to lower prices. Buyers often find homes with insufficient bathrooms relative to bedrooms inconvenient, which diminishes their appeal and market value. However, a ratio of 1—one bathroom for every bedroom—represents an ideal setup for family living, enhancing overall desirability and market value. Conversely, when the ratio exceeds 1.6, there is a dramatic drop in prices. This indicates an excess of bathrooms relative to bedrooms, which buyers may perceive as unnecessary. While additional bathrooms can be seen as a benefit, the market shows diminishing returns in value for properties with more bathrooms than needed, leading to reduced buyer interest and lower prices.

4.5 House age, Waterfront, Housing Condition, and their effects on Average Price & Transaction Volumes

Younger houses tended to have higher average prices often due to modern features, energy efficiency, and lower maintenance needs. According to Realtor.com, “Newly built homes are often priced higher, but builders frequently offer financial incentives such as lower interest rates or help with closing costs, making them attractive options for buyers”. (Neumann, 2023). Within the range of 40-80 years old, the price decreased, and this can be attributed to the higher likelihood of needing updates or renovations, which can deter buyers. According to Realtor.com, “older homes in need of repairs often face lower market value, especially if major systems like plumbing or electrical wiring need to be replaced”. And lastly, houses that were more than 80 years old (which were built before the 1930s, however, saw a growth in price. This increase can be linked to historical value, architectural uniqueness, and well-preserved characteristics. According to Realtor.com, this trend might be explained by the belief that an old house tends to be more well-established and probably in a stable area.” (Neumann, 2023)

Also, newer homes typically have a higher number of transactions, which aligns with buyers’ preferences for modern amenities and lower maintenance costs. Newer homes often sell faster because they have contemporary designs, and newer appliances, and tend to require fewer immediate repairs. However, the older houses were not a very common choice. This decline may be due to older homes often needing more maintenance or upgrades, which can deter potential buyers. Older homes are often considered less desirable due to maintenance concerns, outdated designs, and the need for renovations, which contributes to a reduced volume of sales. In some cases, however, historically significant or well-preserved homes may attract a niche group of buyers who value architectural uniqueness, which could lead to some older homes still seeing sales activity despite the general decline.

In terms of luxurious properties, waterfront properties tend to have a significant premium over non-waterfront homes, often driven by factors like the quality of water access, views, and overall location. According to Vincent Diaz from Vaster (2014), “The answer lies in three key aspects: the unique aesthetic appeal, the recreational opportunities, and the exclusivity that waterfront living offers”.

Additionally, waterfront properties are seen as a limited commodity, which adds to their price premium. The quality of the water, its depth, and clarity, as well as the property’s accessibility to the water, also play important roles in determining value. Homes with direct access to the water or those with private docks tend to fetch higher prices because they provide greater recreational convenience. Moreover, environmental factors such as the risk of flooding or erosion can affect the long-term value of these properties. Buyers must weigh these risks against the inherent.

Regarding house condition, people tend to purchase houses that are in average condition (levels 3 and 4) as they are not too bad to live in or also not too expensive to afford as in level 5. As for levels 1 and 2, we can predict that their future trends will also stay the same as it is not good to purchase a house where you need to fix nearly everything before moving in.

Additionally, the pandemic has accelerated this trend, with more people seeking open space and views, leading to an increased demand for waterfront living. Knight Frank (2022) has reported that “The pandemic and successive lockdowns have put greenery and space at the top of buyers’ requirements for homes, fuelling sales and price growth in prime regional markets”. According to (Griffith, 2024), houses that underwent renovation or improvements, especially in kitchens and bathrooms, experienced a considerable price boost, reflecting buyers’ desire for move-in-ready homes. Houses in poor condition, by contrast, may struggle to sell at market value, as potential buyers factor in the cost of necessary repairs. The relationship between condition and price is particularly important in cities like Seattle, where housing inventory is limited, and buyers are willing to pay more for homes that are well-maintained or recently renovated.

5 LIMITATIONS

While working with the dataset, having acknowledged that it is a sample for studying, there were some limitations from the dataset that hindered the process of processing and analysing data to compute meaningful findings.

5.1 About the dataset

5.1.1 Non-representative data

The dataset contained a small sample. It only reported house purchases in three months of 2014 → could not fully predict the buying patterns across the years. Therefore, the findings could not be 100% guaranteed that they could generalize for the population. (only 2474 out of around 123.23 million). (Korhonen, 2024)

5.1.2 Ambiguous variables within the data set

Some variables exhibited in this dataset are not thoroughly and carefully explained - leading to ambiguity in the way we understand the variable. For example, regarding the condition, how was this data collected? And to what extent were they reliable (because we see a paradox where houses built in the same year, with comparable features, renovated houses are of lower condition than none).

5.1.3 Existence of multiple null values

While the sample size is already small, there are also null values within the data set. Although we managed to rule out the null values (N/A), it still might hinder researchers like us from understanding the broad picture of the real estate market in the USA in three months in 2014.

5.1.4 Lack of paper-supporting findings

Interesting findings such as peak transaction volumes seen on weekdays instead of our expected weekend could not be solidified by trustworthy papers - we could only give some possible explanations. However, we believed that some sort of information traced back to past events somehow made our findings meaningful.

5.2 About data processing and analysis

5.2.1 Oversimplified/or uneducated perspective

Researchers analysing this dataset (we) do not live in the USA so we might do have a comprehensive understanding of the real estate market, which would result in the negligibility of some important parameters or specific nuances that affect the relationship between house prices and house features.

5.2.2 Selection bias

The process of sampling the dataset and self-selection of some data for our analysis might be driven by our motivation to find meaningful findings. This unconsciously biased sampling method data selection and and it might not be representative of the entire US housing market.

5.2.3 Lack of thorough understanding of variables

Most researchers’ understanding of variables is based on the brief description given from the dataset rather than the deep understanding of each variable. Therefore, we might also neglect some underlying factors or essential relationships between them.

6 CONCLUSION

In conclusion, with the dataset about USA House Price in the context of recording house transactions in three months - May, June, and July of 2014 - we saw some interesting trends. Through exploring and processing the data, we chose meaningful variables and figured out the connection between them to find the answer to our research questions surrounding the dataset. We were curious about the customer behavior pattern in purchasing houses in the Seattle Metropolitan Area from May to July 2014 through examining the transaction (values and trend in time), and specific house features such as size, amenities, and location.

One of the observations was a positive correlation between house price and special amenities/features: house size, floors, bathroom, bedroom, view, condition, and house age. Also, different periods might reflect different preferences in building houses, which later affect buying behaviors. After analyzing data, we also understood the psychology of buyers when buying a home, as they tended to maximize the capacity and optimize their money - which can be seen through the popularity of houses with optimal ratio of bathroom/bedroom; or buying young houses to avoid complicated renovation. Another novel point about transactions was that people tended to make transactions on weekdays rather than weekends (which was contrary to our thoughts), and this could be explained by the fact that paperwork relating to transactions was processed during business hours.

From the findings, we suggest that it could be useful for both home buyers and home sellers. Homebuyers could maximize their investment by focusing more on efficient layouts, such as on floors or ratio of rooms to ensure a living experience while getting the best return on investment. Also, sellers might get insight into what house features could be attractive to buyers so that they could highlight those features and strategically build a price plan based on the condition and amenities of the house, such as the location of a house (with a waterfront view). In general, we believe that people who are interested in investing or selling real estate should understand the condition of the market and make data-driven decisions; as it would lead to the most beneficial outcomes for all parties.

7 REFERENCES

Booth, S. (2022, October 26). One-Story vs. Two-Story Home: Which Is Better? Realtor.com. Retrieved October 19, 2024, from https://www.realtor.com/advice/buy/one-story-vs-two-story-home-which-is-better/

Diaz, V. (2024, July 8). How much does waterfront add to property value?. Market Insights. https://blog.vaster.com/how-much-does-waterfront-add-to-property-value

DeMarco, J. (2023, November 15). The worst day to close on a house - orchard. RSS. https://orchard.com/blog/posts/worst-day-to-close-on-a-house

Frame, T. (2015, June 4). The advantages of buying a home at the end of the Month. The Seattle Times. https://www.seattletimes.com/business/real-estate/the-advantages-of-buying-a-home-at-the-end-of-the-month/

Griffith, K. (2024, September 26). The home renovations that will boost your listing price the most. Real Estate News & Insights | realtor.com®. https://www.realtor.com/advice/sell/home-renovations-boost-listing-price/

How does the summer season influence the housing market?. NJ Lux Real Estate. (2024, June 10). https://www.njlux.com/how-does-the-summer-season-influence-the-housing-market/

Korhonen, V. (2024, July 5). U.S.: Number of households 1960-2023. Statista. https://www.statista.com/statistics/183635/number-of-households-in-the-us/?fbclid=IwZXh0bgNhZW0CMTEAAR14hJDTd0NJ3s6GCLfL_GfMBZFrR3I5HImRZLEgg-E6mKrpPGR1Xu1uiMw_aem_kJwcYUfYr04aB6KZ9_CsMw

Montoya, K. (2021, March 8). One-Story vs. Two-Story Homes: How to Decide on the Right Floor Plan. Adair Homes Blog. Retrieved October 19, 2024, from https://blog.adairhomes.com/homebuilding-tips/one-story-vs.-two-story-homes-how-to-decide-on-the-right-floor-plan

Morehouse, W. (1981, April 24). American cities rediscover their historic waterfronts - The Washington Post. The Washington Post. https://www.washingtonpost.com/archive/realestate/1981/04/25/american-cities-rediscover-their-historic-waterfronts/787096bb-f6e0-42d6-a354-7952f5976da0/

Nadia, E. (2019, January 2). Seasonality in the housing market. National Association of REALTORS®. https://www.nar.realtor/blogs/economists-outlook/seasonality-in-the-housing-market

Neumann, K. D. (2023, June 13). The weird reason why brand-new homes are now cheaper than old ones-and where they are hiding. Real Estate News & Insights | realtor.com®. https://www.realtor.com/advice/buy/how-to-find-a-brand-new-house-that-costs-less-than-an-existing-home/

Overview of the Seattle area, Washington (metro area). The Demographic Statistical Atlas of the United States - Statistical Atlas. (n.d.). https://statisticalatlas.com/metro-area/Washington/Seattle/Overview#google_vignette

Staff, T. W. (2014, January 7). A quick guide to real estate in 2014. theweek. https://theweek.com/articles/453581/quick-guide-real-estate-2014

Salant, K. (2006, December 29). The Ranch, an architectural archetype forged on the frontier - The Washington Post. https://www.washingtonpost.com/archive/realestate/2006/12/30/the-ranch-an-architectural-archetype-forged-on-the-frontier/e92d7b5b-78d7-4f7f-aa5b-68eff176ca60/

Which Global City generates the highest waterfront property price premium?. Knight Frank. (2022, August 9). https://www.knightfrank.com/research/article/2022-08-09-which-global-city-generates-the-highest-waterfront-property-price-premium